bigdata_uclfandomcom-20200215-history
Similar items
'Part 1 : Slides' 'Part 2 : Project' In this project, we try to find similar tweets in a database. We start with a .csv file including the tweet, the language, the date of submission, the localisation, etc. So, we need to extract the tweet and make a certain selection, because, we have more than 1 billion tweets. We faced some difficulties with the accents in french, and to reduce the number of tweets, we selected only the tweets in english and with a length bigger or equal to 15 characters. Indeed, it's more interesting to compare two tweets with more than 4 words. Our method to extract these messages is the following one: *********************************Code extraction********************************* file=open(r"TweetsData.csv","r", encoding="utf8") destination=open(r"MessagesENSup15.csv","w", encoding="utf8") i=0 longueur=15 DateMax='2014-01-31 23:59:59' DateMin='2013-11-01 00:00:00' langue='en' #'fr' dateL=3 lang=12 k=0 Message=[] for line in file: contenu=file.readline() cut=contenu.split(',') taille=len(cut) if taille=DateMin): mess=cut1 j=2 while taille >longueur: mess='{0},{1}'.format(mess,cutj) j=j+1 taille=taille-1 if len(mess)<=15: pass else: Message.append(mess) k=k+1 destination.write("%s\n" % (mess)) else: pass destination.close() file.close() Part 2: Notion of similarity Part 3 There are two version of tweet's comparison, one where we compare one tweet to the others (one-many), and the other one when we want to find pairs of tweets that are similar in a whole set (many-many). One-many In the first case, we have a tweet and we want to find tweets that resemble our. This occurs, for example, when we want to know if there are people who write the same thing as us. In this case, we have no choice but to compare the tweet T to all others by the k-shingles technique. Example: In this part we selected only the tweets in english and with a length bigger or equal to 10 characters. Take the tweet T: "justin bieber i love you" Using the technique of k-shingles, we compare this tweet to the others with the Jaccard Similarity. Here is an histogram of similarity of the other tweets with T. We didn't represent those with Jaccard Similarity equal to zero. We clearly see that the most of tweets are textually different than T. In fact, from ???? tweets in english, we found * 1046372 tweets with a similarity between 0% and 1% * 15033 with a similarity between 10% and 20% * 1733 with a similarity between 20% and 30% * 438 with a similarity between 30% and 40% * 101 with a similarity between 40% and 30% * 56 with a similarity between 50% and 60% * 0 with a similarity greater than 60% Here some of them: * "little happy meal" : 0% * "@Louis_Tomlinson I fucking love you to ?? : 11% * "@_MIKEw_ i love you!" : 24.2% * "@justinbieber follow me i love you": 38.5% * "@justinbieber love you xx" : 40% * "@justinbieber love you..." : 56% We can observe that even a Jaccard Similarity of 55% is significant here. We emphasize that in this project, we study the textual similarity between two tweets and not the content. Many-many In the second situation, we have a set of tweets and we want to find pairs of similar tweets. In this case, the number of pairs is too great for us to evaluate the similarity of each pair. In order to decrease the space needed to store a set, we could hash sets of shingles to four bytes each. The set representing a tweet is then the set of integers that are bucket numbers of one or more k-shingles that appear in the tweet. But even in this case, the space needed is still roughly four times the space taken by the tweet. A second idea is to use a technique called "minhashing" which compress tweets into small signatures and preserve the expected similarity of any pair of tweet. We can visualise a collection of sets as their "characteristic matrix". In this matrix, the columns correspond to the sets and the rows correspond to elements of the universal set. To "minhash" a set represented in this way, we pick a permutation of the rows. Then the minhash value is the number of the first row in which the column has a 1 (in the permuted order). We have an important property: "The probability that the minhash function for a random permutation of rows produces the same value for two sets equals the Jaccard similarity of those sets". To see that, we randomly take 10 tweets and we calculate the minhash signature with 100 permutation. We can simulate the effect of a random permutation by a random hash function that hash maps row numbers to as many buckets as there are rows.